ANU BDSI
workshop Data Wrangling with R Part 1
Biological Data Science Institute
8th April 2024
dplyr
tidyr
dplyr
Definition of a tidy data (Wickham, 2014)
dplyr, tidyr and ggplot2 are downstream packages to work with tidy dataWickham (2014) Tidy Data. Journal of Statistical Software
reshape, first released on CRAN in 2005-08-05reshape2 released on CRAN in 2010-09-10tidyr released on CRAN in 2014-07-21 *v1.0.0 released 2019-09-12Wide to long
reshape::melt reshape2::melt tidyr::gather tidyr::pivot_longer*Long to wide
reshape::cast reshape2::dcast tidyr::spread tidyr::pivot_wider*Hadley Wickham (2020). tidyr: Tidy Messy Data. R package version 1.1.2.
Hadley Wickham (2007). Reshaping Data with the reshape Package. Journal of Statistical Software, 21(12), 1-20
tidyverse packages often are labelled with a badge like on the leftLionel Henry (2020). lifecycle: Manage the Life Cycle of your Package Functions. R package version 0.2.0.
tidyr Part 1| state | 2019 | 2018 | 2017 |
|---|---|---|---|
| NSW | 8130159 | 80366651 | 7919815 |
| VIC | 6655284 | 6528601 | 6387081 |
| ACT | 427892 | 423169 | 415874 |
| state | year | population |
|---|---|---|
| NSW | 2019 | 8130159 |
| NSW | 2018 | 80366651 |
| NSW | 2017 | 7919815 |
| VIC | 2019 | 6655284 |
| VIC | 2018 | 6528601 |
| VIC | 2017 | 6387081 |
| ACT | 2019 | 427892 |
| ACT | 2018 | 423169 |
| ACT | 2017 | 415874 |
Values adapted from Australian Bureau of Statistics. (2020). Table 04. Estimated Resident Population, States and Territories [Time series spreadsheet]. National, state and territory population, Australia Mar 2020. Retrieved Nov 24, 2020. https://www.abs.gov.au/statistics/people/population/national-state-and-territory-population/mar-2020/310104.xls
tidyr Part 2yield_long <- data.frame(year = c(1900, 1900, 2000, 1900, 1900, 2000, 2000),
state = c("Iowa", "Kansas", "Kansas", "Iowa", "Kansas", "Iowa", "Kansas"),
crop = c("barley", "barley", "barley", "wheat", "wheat", "wheat", "wheat"),
yield = c(28.5, 18, 35, 14.4, 18.2, 47, 37))
yield_wide <- pivot_wider(yield_long, names_from = crop, values_from = yield, names_glue = "{crop}_yield")| year | state | crop | yield |
|---|---|---|---|
| 1900 | Iowa | barley | 28.5 |
| 1900 | Kansas | barley | 18.0 |
| 2000 | Kansas | barley | 35.0 |
| 1900 | Iowa | wheat | 14.4 |
| 1900 | Kansas | wheat | 18.2 |
| 2000 | Iowa | wheat | 47.0 |
| 2000 | Kansas | wheat | 37.0 |
United States Department of Agriculture, National Agricultural Statistics Service. http://quickstats.nass.usda.gov/
Kevin Wright (2020). agridat: Agricultural Datasets. R package version 1.17
tidyr Part 3yield_long <- data.frame(year = c(1900, 1900, 2000, 1900, 1900, 2000, 2000),
state = c("Iowa", "Kansas", "Kansas", "Iowa", "Kansas", "Iowa", "Kansas"),
crop = c("barley", "barley", "barley", "wheat", "wheat", "wheat", "wheat"),
yield = c(28.5, 18, 35, 14.4, 18.2, 47, 37))
yield_wide <- pivot_wider(yield_long, names_from = crop, values_from = yield, names_glue = "{crop}_yield")| year | state | crop | yield |
|---|---|---|---|
| 1900 | Iowa | barley | 28.5 |
| 1900 | Kansas | barley | 18.0 |
| 2000 | Kansas | barley | 35.0 |
| 1900 | Iowa | wheat | 14.4 |
| 1900 | Kansas | wheat | 18.2 |
| 2000 | Iowa | wheat | 47.0 |
| 2000 | Kansas | wheat | 37.0 |
United States Department of Agriculture, National Agricultural Statistics Service. http://quickstats.nass.usda.gov/
Kevin Wright (2020). agridat: Agricultural Datasets. R package version 1.17
tidyr Part 4| year | state | crop | metric | value |
|---|---|---|---|---|
| 1900 | Iowa | barley | yield | 28.5 |
| 1900 | Iowa | barley | acres | 620,000.0 |
| 1900 | Kansas | barley | yield | 18.0 |
| 1900 | Kansas | barley | acres | 127,000.0 |
| 2000 | Kansas | barley | yield | 35.0 |
| 2000 | Kansas | barley | acres | 7,000.0 |
| 1900 | Iowa | wheat | yield | 14.4 |
| 1900 | Iowa | wheat | acres | 1,450,000.0 |
| 1900 | Kansas | wheat | yield | 18.2 |
| 1900 | Kansas | wheat | acres | 4,290,000.0 |
| 2000 | Iowa | wheat | yield | 47.0 |
| 2000 | Iowa | wheat | acres | 18,000.0 |
| 2000 | Kansas | wheat | yield | 37.0 |
| 2000 | Kansas | wheat | acres | 9,400,000.0 |
| year | state | barley_yield | wheat_yield | barley_acres | wheat_acres |
|---|---|---|---|---|---|
| 1900 | Iowa | 28.5 | 14.4 | 620000 | 1450000 |
| 1900 | Kansas | 18.0 | 18.2 | 127000 | 4290000 |
| 2000 | Kansas | 35.0 | 37.0 | 7000 | 9400000 |
| 2000 | Iowa | NA | 47.0 | NA | 18000 |
crop_long crop_wide
United States Department of Agriculture, National Agricultural Statistics Service. http://quickstats.nass.usda.gov/
Kevin Wright (2020). agridat: Agricultural Datasets. R package version 1.17
tidyr Part 5| year | state | crop | metric | value |
|---|---|---|---|---|
| 1900 | Iowa | barley | yield | 28.5 |
| 1900 | Iowa | barley | acres | 620,000.0 |
| 1900 | Kansas | barley | yield | 18.0 |
| 1900 | Kansas | barley | acres | 127,000.0 |
| 2000 | Kansas | barley | yield | 35.0 |
| 2000 | Kansas | barley | acres | 7,000.0 |
| 1900 | Iowa | wheat | yield | 14.4 |
| 1900 | Iowa | wheat | acres | 1,450,000.0 |
| 1900 | Kansas | wheat | yield | 18.2 |
| 1900 | Kansas | wheat | acres | 4,290,000.0 |
| 2000 | Iowa | wheat | yield | 47.0 |
| 2000 | Iowa | wheat | acres | 18,000.0 |
| 2000 | Kansas | wheat | yield | 37.0 |
| 2000 | Kansas | wheat | acres | 9,400,000.0 |
| year | state | barley_yield | wheat_yield | barley_acres | wheat_acres |
|---|---|---|---|---|---|
| 1900 | Iowa | 28.5 | 14.4 | 620000 | 1450000 |
| 1900 | Kansas | 18.0 | 18.2 | 127000 | 4290000 |
| 2000 | Kansas | 35.0 | 37.0 | 7000 | 9400000 |
| 2000 | Iowa | NA | 47.0 | NA | 18000 |
crop_wide crop_long
United States Department of Agriculture, National Agricultural Statistics Service. http://quickstats.nass.usda.gov/
Kevin Wright (2020). agridat: Agricultural Datasets. R package version 1.17
| package | maintainer |
|---|---|
| dplyr | Hadley Wickham |
| magrittr | Lionel Henry |
| rlang | Lionel Henry |
| stringr | Hadley Wickham |
| tibble | Kirill Müller |
| tidyr | Hadley Wickham |
| tidyselect | Lionel Henry |
🎯 separate maintainer name to columns, first name and last name
# A tibble: 7 × 3
package first_name last_name
<chr> <chr> <chr>
1 dplyr Hadley Wickham
2 magrittr Lionel Henry
3 rlang Lionel Henry
4 stringr Hadley Wickham
5 tibble Kirill Müller
6 tidyr Hadley Wickham
7 tidyselect Lionel Henry
author_dat <- tribble(~package, ~author,
"dplyr", "Hadley Wickham, Romain François, Lionel Henry, Kirill Müller",
"magrittr", "Lionel Henry, Stefan Milton Bache, Hadley Wickham",
"tidyr", "Hadley Wickham",
"stringr", "Hadley Wickham",
"rlang", "Lionel Henry, Hadley Wickham",
"tibble", "Kirill Müller, Hadley Wickham",
"tidyselect", "Lionel Henry, Hadley Wickham") %>%
arrange(package)| package | author |
|---|---|
| dplyr | Hadley Wickham, Romain François, Lionel Henry, Kirill Müller |
| magrittr | Lionel Henry, Stefan Milton Bache, Hadley Wickham |
| rlang | Lionel Henry, Hadley Wickham |
| stringr | Hadley Wickham |
| tibble | Kirill Müller, Hadley Wickham |
| tidyr | Hadley Wickham |
| tidyselect | Lionel Henry, Hadley Wickham |
# A tibble: 15 × 2
package author
<chr> <chr>
1 dplyr Hadley Wickham
2 dplyr Romain François
3 dplyr Lionel Henry
4 dplyr Kirill Müller
5 magrittr Lionel Henry
6 magrittr Stefan Milton Bache
7 magrittr Hadley Wickham
8 rlang Lionel Henry
9 rlang Hadley Wickham
10 stringr Hadley Wickham
11 tibble Kirill Müller
12 tibble Hadley Wickham
13 tidyr Hadley Wickham
14 tidyselect Lionel Henry
15 tidyselect Hadley Wickham
anu-bdsi.github.io/workshop-data-wrangling-R1/